ADAO  59936 


UNCLASSIFIED 

SECURITY  CLASSIFICATION  of  This  RACE  Pmtm  Enf«f«iEI 


M)  Rl 


DOCUMENTATION  PAGE 


ATP  V ^ Kt.A[)  INSTRUCTIONS 

MV,[- V,  UKKORK  COMULI  TlNf.  FORM 

, GOVT  ACCESSION  NO  I 1 RECIPIENT'S  CATALOG  MUMUEP 


Ai’OSKflTR-  i 8 - i 3 5 5 


|4  TITLE 


lot  &JALYSIS  and  resign  of  fault-tolerant 

A COMPUTER  ^ysitfls  , 


2 


S TYPE  OF  REPORT  4 PERIOO  CO/EREO 

(V'nter'm 


7.  author/ •> 


John  P 


IL  ««ll  S ER 


t.  CONTRACT  OR  ORAN  T NUMBER'!! 


.^Hayes 


1*J|/AFc}siW7-3352 


* PERFORMING  ORGANIZATION  NAME  AND  AOORESS 

University  of  Southern  California 
Electronic  Sciences  Laboratory  s' 

Los  Angeles,  California  90007  

M CON  TROLLING  OFFICE  NAME  AND  AOORESS 

Air  Force  Office  of  Scientific  Research/NM 
Bolling  AFB,  Washington,  DC  20332 


AREA  A WORK  JJNI  T NUMBERS 

&n  (2l) 

611 02F  yjf 

12.  REPORT  OATE 

August  I,  1978 

11.  number  of  pages 
26 


14  MONITORING  AGENCY  name  * TOORESSTir 

t ! •*  wj  TJ 


L ontto/flng  Ollttm)  I IS.  SECURITY  CLASS  ( ol  Ihlm  tmportj 


fj  |r  pp*  7 

- - - •-  ^ 

ST^TfMFMTiAiiJlJA  Hmpofl) 

>Ufl  12 


UNCLASSIFIED 

ISa  OECLASSIFI  A - JH  OOWNGPAJING 

scheoule 


I 16  OlSTRIBUT{QM  ST  jUFUfU  T lAiy  Wtoo/I.  y % 

IlL  a 5 JIJ.  . o ,§2% 

12TJTZ:  1 X?  _ s * A 


17  DISTRIBUTION  STATEMENT  (ol  Ihm  mKTTtmi 


I,  It  dtllmemnl  from  (import) 


lit  supplementary  notes 


19  KEY  WOROS  fConffnuo  on  r«v«n«  titf*  f/  nmcmmmmry  *nd  block  numbar) 


Bit-sliced  mlcroporcessors 
Communication  networks 
Connecting  networks 
Fault  diagnosis 
Fau 1 t-tolerant  computing 


Graph  model s 
Microprocessors 
Mu  I ti processors 
Recovery  * 

Test  generat Ion 


[20  abstra 


Lrmtlnum  on  rmvmram  mid*  II  nmc  maamry  mod  Idmntlly  by  block  ntimbmr) 


This  report  describes  the  first-year  results  of  an  Investigation  of 
faul t-tolerant  computer  systems.  A new  method  for  measuring  recovery  time 
In  fau I t-tolerant  multiprocessors  was  developed.  A complete  characterization 
of  optimally  t-step  recoverable  systems  was  obtained,  and  certain  graph 
transformations  that  simplify  recovery  analysis  were  studied.  Some 
d iagnosabi II ty  properties  of  n-cube  Interconnection  networks  were  derived, 

A study  of  fault  tolerance  In  large  connecting  networks  was  Initiated  using 
a new  concept  of  dynamic  full  access.  A design  theory  based  on  recursive ^ 


lo  ^ il>  t \$z 


LASSIFIED 


SECumTV  CLASSIFICATION  OF  Inis  PAGEf  *Fh«n  /)•>•  f-nfrmd) 


Abstract  continued. 


^component  expansion  capabilities  was  developed  for  MSi/LSI  systems.  The 
use  of  similar  recursive  methods  for  test  pattern  generation  was  also 
initiated.  Promising  results  were  obtained  for  testing  bit-sliced 
microprocessors  and  related  components. 

f\ 


AFOSK-TR.  7 8 - J 8 5 5 


1977-73  Annual  Technical  Report 


Air  Force  Office  of  Scientific  Research 
Grant  tlo.  AFOSR-77-3352 


ANALYSIS  AND  DESIGN  OF  FAULT-TOLERANT  COMPUTER  SYSTEMS 


Prepared 

by 

John  P.  Hayes 


Electronic  Sciences  Laboratory 
University  of  Southern  California 
Los  Angeles,  California  90007 


August  1,  1978 


09  9 


Approved  tor 
distribution 


public  rolonso* 

unlimited. 


/ 


page 


Abstract  i 

1.  Research  objectives 1 

2.  Research  accomplishments  2 


2.1  Recovery  modeling  in  multiprocessor  systems.  ...  2 

2.2  Communication  networks  for  multi-microprocessors  . 4 


2.3  Design  and  testing  of  MSI  and  LSI  systems 7 

2.4  References 11 

j 

3.  Publications 13 

4.  Personnel 14 

5.  Interactions 15 

6.  Summary  and  future  plans 16 


Appendix: 


Fault  recover/  in  mu1  tiprocessor  networks’*  . . 


17 


i 


ABSTRACT 


This  report  describes  the  first-year  results  of  an  investigation  of 
fault-tolerant  computer  systems.  A new  method  for  measuring  recovery 
time  in  faul t- tolerant  multiprocessors  was  developed.  A complete  characterization 
of  optimally  t-step  recoverable  systems  was  obtained,  and  certain  graph 
transformations  that  simplify  recovery  analysis  were  studied.  Some 
diagnosabiiity  properties  of  r.-cube  interconnection  networks  were  derived. 

A study  of  fault  tolerance  in  large  connecting  networks  was  initiated  using 
a new  concept  of  dynamic  full  access.  A design  theory  based  on  recursive 
component  expansion  capabilities  was  developed  for  MSI/LSI  systems.  The 
use  of  similar  recursive  methods  for  test  pattern  generation  was  also 
initiated.  Promising  results  were  obtained  for  testing  bit-sliced 
microprocessors  and  related  components. 
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1.  RESEARCH  OBJECTIVES 


The  purpose  of  this  research  project  is  to  develop  methods  for  the 
analysis  and  synthesis  of  complex  fault- tolerant  computer  systems.  It 
is  motivated  by  recent  rapid  developments  in  large-scale  integration  (LSI) 
technology,  especially  the  introduction  of  microprocessors,  which  are 
expected  to  increase  greatly  the  use  of  multiple  computer  systems  that  are 
required  to  be  highly  reliable.  The  research  is  particularly  concerned  with 
dynamic  reconfiguration  and  recovery  in  the  event  of  failures,  topics 
which  have  received  relatively  little  research  attention  in  the  past. 

It  is  intended  to  develop  specific  measure  of  the  cost  and  complexity  of 
reconfiguration  and  recover/,  and  to  derive  efficient  fault  tolerance 
algorithms  based  on  these  measures.  Various  graph  theoretical  and 
algebraic  tools  are  used  in  this  research,  with  the  facility  graph  model 
(1] , developed  by  the  Principal  Investigator,  serving  as  a starting  point. 

The  special  problems  associated  with  the  design  of  systems  containing 

many  microprocessors,  particularly  the  problem  of  interprocessor  communication, 

are  also  being  investigated. 


2.  RESEARCH  ACCOMPLISHMENTS 


During  1977-78  results  were  obtained  in  three  main  areas: 

(1)  Recovery  modeling  in  multiprocessor  systems 

(2)  Communication  networks  for  multi-microprocessors 

(3)  Design  and  testing  of  MSI  and  LSI  systems 

These  results  are  described  in  detail  in  the  following  subsections. 

2.1  Recovery  modeling  in  multiprocessor  systems  [2,  3}1 

A new  method  for  characterizing  the  recovery  time  of  fault-tolerant 
multiprocessor  systems  was  developed.  The  system  is  represented  by  a 
facility  graph  Gr  in  which  nodes  correspond  to  processors  and  edges 
correspond  to  communication  links  [1J . The  fault-free  nodes  include  nodes 
actively  engaged  in  data  processing  and  nodes  acting  as  standby  spares. 

A fault  is  represented  by  the  removal  of  a node  and  its  associated  edges 
from  Gr . Faults  are  tolerated  by  reconfiguring  the  pattern  of  active  and 
spare  nodes  in  Gr  so  that  there  always  exists  an  active  subnetwork  that  is 
isomorphic,  that  is,  has  the  same  (logical)  interconnection  structure,  as 
a certain  minimum  configuration  G^  called  the  basic  system.  G^  can  be 
taken  as  the  minimum  fault-free  system  needed  to  perform  a particular  set 
of  tasks. 

A system  G^  is  called  k-fault-tolerant  (k-FT)  t-step  recoverable  (t-SR) 
if  it  can  recover  from  up  to  k faults  by  changing  the  states  of  at  most  t 
fault-free  nodes,  k is  clearly  a measure  of  the  amount  of  damage  the 

^■Reference  13}  forms  an  appendix  to  this  report. 
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system  can  tolerate.  A state  change  e.g.,  from  spare  to  active,  typically 
involves  the  establishment  of  new  logical  paths  in  the  system,  and  the 
transfer  of  programs  and  data  between  the  affected  nodes.  If  n state  changes 
of  average  duration  c are  required  to  recover  from  a particular  fault,  then 
nc  is  the  total  recovery  time.  Thus  the  parameter  t defined  above  is 
proportional  to  the  maximum  recovery  time  required  by  G . 

Clearly  t >_  k.  A case  of  particular  interest,  corresponding  to  a 
class  of  systems  with  minimum  recovery  time,  is  where  t = k.  In  such 
systems  recovery  from  t faults  is  achieved  by  immediate  replacement  of 
each  failed  node  by  a fault-free  spare.  Gr  is  defined  to  be  optimally 
t-SR  with  respect  to  an  n-node  basic  system  3^  if 

(1)  Gr  is  t- FT/ t-SR  with  respect  to  G^ 

(2)  G^  contains  the  minimum  number  of  nodes,  viz.  n + t 

(3)  Gr  contains  the  fewest  edges  among  all  systems  satisfying 
conditions  (1)  and  (2) 

In  [3]  we  prove  that  the  optimal  t-SR  realization  of  every  G^  is  unique, 

and  that  it  has  a surprisingly  simple  structure.  Figure  la  shows  an 

example  of  a basic  graph  1^  consisting  of  four  processors  arranged  in  a 

OPT 

ring.  Figure  lb  shows  the  corresponding  optimal  2-SR  graph  I 2 . It 

consists  of  Ib  with  two  additional  spare  nodes,  labeled  s^  and  s 2,  and 

additional  edges  connecting  s and  to  all  nodes,  including  each  other. 

OPT 

Every  fault  graph  formed  by  removing  one  or  two  nodes  from  I contains 
a subgraph  isomorphic  to  I_  (the  2-FT  property).  Furthermore,  each  such 

1 i 

subgraph  can  be  chosen  so  that  it  differs  from  the  original  active  subgraph 

i 


\ 


in  at  most  two  nodes  (the  2-SP  property) . 
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Optimal  t-SR  systems  have  the  disadvantage  that  the  number  of  edges 
connected  to  some  nodes  (the  node  degree)  may  be  very  large.  Since  this 
represents  the  number  of  parallel  data  paths  to  a processor,  it  is  often 
severely  restricted  by  physical  considerations,  for  example,  microprocessor 

1 

pin  limitations.  Thus  nonoptimal  fault-tolerant  systems  with  limited  node 

fanout  are  of  interest.  We  have  investigated  a class  of  graph  transformations,  * 

i 

called  line  graph  transformations,  which  lead  to  t-SR  designs  with  nodes 
of  lower  degree  than  the  corresponding  optimal  t-SR  systems  [3] . We  have 
also  shown  that  line  graph  transformations  greatly  simplify  the  computation 
of  the  parameters  k and  t. 


An  extensive  survey  of  systems  containing  many  microprocessors  was 
completed.  Two  major  communications  structures  for  such  systems  were 
identified?  the  hierarchical  bus  organization  represented  by  Cm*  [5] , and 
the  n-cube  organization  proposed  by  several  researchers  [6,  7] . Most  of 
the  published  work  in  this  area  deals  with  unimplemented  paper  designs 
with  little  analytical  basis.  System  reliability  and  fault  tolerance  have 
also  been  largely  ignored. 

A network  organization  with  a relatively  sound  analytical  basis  is 
the  binary  n-cube  structure  [7] . This  contains  2n  processors  whose  logical 
interconnection  structure  can  be  represented  by  an  n-dimensional  cube. 
Figure  2a  shows  the  structure  of  the  3-cube.  We  have  investigated  several 
aspects  of  the  fault  tolerance  of  n-cube  networks.  Using  the  approach  of 
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Preparata  et  al.  [8)  we  have  shown  that  the  diagnosability  of  an  n-cube 

system  is  n for  n >_  3,  where  the  diagnosability  of  a system  is  defined 

as  the  largest  number  k such  that  the  system  is  one-step  k-fault  diagnosable  [4]. 

N-cube  arrays  can  be  implemented  using  connecting  networks  of  the 
type  long  used  in  telephone  exchanges  [9] . Figure  2b  shows  one  such 
implementation  of  the  3-cube  using  twelve  switches  denoted  5.  Each  S may 
be  considered  to  have  two  states,  the  "through"  and  "cross"  states  depicted 
in  Figure  2c.  We  have  begun  investigating  the  fault  tolerance  properties 
of  connecting  networks  of  this  kind.  A study  of  actual  circuits  used  for 
3 110)  indicates  that  most  faults  in  the  network  can  be  modeled  by  switches 
that  are  stuck  at  the  through  state  (s-a-T)  or  stuck  at  the  cross  state 
(s-a-X) . 

We  have  defined  a connecting  network  N to  have  the  dynamic  full  access 
property  if  each  processor  P^  can  be  connected  to  any  other  processor 
via  a finite  (but  unspecified)  number  of  passes  through  the  connecting 
network.  This  is  a generalization  of  the  usual  full  access  property  [9) . 

N is  said  to  be  k-fault  tolerant  (k-FT)  with  respect  to  the  foregoing  s-a-T/X 
fault  model  if  the  failure  of  k or  fewer  switches  in  N does  not  destroy  the 
dynamic  full  access  property.  We  have  begun  investigating  the  conditions 
for  N to  be  k-FT.  It  is  hoped  that  this  work  will  lead  to  methods  for 
designing  efficient  and  fault-tolerant  communication  networks  for  large 
multi-microprocessor  systems. 

2.3  Design  and  testing  of  MSI  and  LSI  systems  [11,  12) 

Most  existing  analytical  tools  are  inadequate  for  dealing  with 
digital  components  above  the  gate  and  flip-flop  levels,  which  correspond 
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to  small-scale  integration  (SSI)  in  current  technology.  There  is  at  present 
no  adequate  theory  for  the  design  or  testing  of  MSI  and  LSI  devices, 
although  the  need  for  such  a theory  has  long  been  recognized.  Perhaps 
the  only  LSI  device  for  which  a promising  theory  of  testing  is  emerging 
is  the  semiconductor  random  access  memory  (PAM)  [13J. 

We  have  observed  that  a significant  property  of  components  at  all 
complexity  levels  is  expansibility,  which  is  the  ability  of  components  of 
a given  type  to  be  interconnected  in  a systematic  way  to  form  larger 
components  of  the  same  type  112]  . The  larger  component  performs  the  same 
operation  as  its  constituent  elements,  but  processes  more  and/or  bigger 
operands.  Many  MSI  and  LSI  design  rules  are  merely  recipes  for  component 
expansion,  e.g.,  how  to  build  a 1-out-of-H  decoder  using  1-out-of-n 
decoders  where  N > n,  or  how  to  build  an  N x M PAM  using  n x m RAM  IC's 
where  :>  > n or  M > m 114].  Expansibility  plays  a particularly  important 
role  in  the  architecture  of  microcomputers.  The  major  design  problems 
revolve  around  the  number,  size  and  interconnections  of  the  RDM's,  PAM's 
and  10  interface  circuits  used,  problems  which  are  intimately  associated  with 
the  expansibility  of  these  components.  With  bit-slice  architecture  the  CPU 
(microprocessor)  becones  an  expandable  design  component.  Two  main  expansion 
techniques  havo  boon  identified,  expansion  by  composition  and  by  replication 
(12).  Expansion  methods,  which  correspond  to  design  rules,  can  be  concisely 
defined  by  recursive  equations.  For  example,  a typical  MSI  component,  a 
ripple-carry  adder  can  be  defined  as  follows*. 


Basiss  ADD1  (x  , y , c.  ) ■ x v + x.c.  * y c.  , x © y © c, 
0:1  0 /0  in  Q'o  0 in  '0  in  0 w 70  w in 


ADD 


n-f  1 


'Om./’W  v0,n'  ci„’  ' ‘“’o.l'V  V jn'  *!,„•  V1’ 


ADD,  (x,  , y,  , c.  ) . 

1 sn  l sn  ‘ 1 :n  in 
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Here  x^  and  y ^ denote  input  data  lines,  and  denotes  a carry  line. 

We  have  proposed  a classification  scheme  for  expansion  algorithms  based 

on  three  parameters:  the  presence  of  feedback,  the  use  of  constant  inputs 

or  outputs,  and  the  logical  depth  of  the  interconnections  used.  We  have 

shown  that  most  standard  components  can  be  expanded  using  FS2  algorithms 

which  allow  neither  feedback  nor  constant  input/output  values,  and  which 

require  two  (the  minimum  number)  logic  levels.  Some  other  useful 

expansion  methods  have  also  been  identified  (12]. 

We  have  also  demonstrated  that  recursive  techniques  can  be  used  fcr 

test  pattern  generation.  As  a simple  illustration  consider  the  n-input 

AND  function  AflDn.  Let  Tn(x„,  x, , ...,  x , ) be  a Boolean  function  denoting 

0 1 n- 1 

the  (unique)  set  of  test  patterns  for  stuck- type  faults  in  ANt>n;  T°(X)  - 1 
if  and  only  if  X is  a test  pattern.  We  can  define  the  tests  for  AtJDn 
recursively  as  follows. 


Basis:  T (x^,  x^)  ■ x x * x x^  ♦ x x 

T ^^(x  , x , ....  x ) ■Tn(x»,  x , ...,  x ,)x  ♦ x x ...  x , x 

01  n 01  n-lnOl  n-ln 


We  have  started  to  extend  this  test  generation  philosophy  to  obtain  efficient 
and  systematic  test  procedures  for  MSI/LSI  systems.  Besides  leading  to 
analytic  testing  methods,  this  approach  has  the  added  advantage  of  being 
relatively  independent  of  such  factors  as  word  size,  making  it  possible 
to  analyze  all  members  of  a family  of  components  simultaneously. 

We  have  carried  out  a study  (unpublished)  of  the  feasibility  of  this 
general  approach  for  testing  bit-sliced  microprocessors.  We  use  as  the 
basic  component  the  1-bit  processor  cell  M shown  in  Figure  3.  M has 
most  of  the  major  features  of  a commerical  bit-sliced  microprocessor,  such 
as  the  Intel  3002  2-bit  processor  (14)  or  the  Am2901  4-bit  processor  (161 
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Figure  3.  Processor  cell  M used  for  analyzing  bit-sliced  microprocessors. 
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(only  the  shift  function  and  the  status  flags  have  been  omitted) . It 
contains  two  registers  A and  T and  two  complex  coi.  inational  circuits, 
a multiplexer  and  an  arithmetic- logic  unit  ALU.  Using  the  most  general 
functional  fault  model,  which  allows  aribtrary  functional  changes  in  the 
individual  registers  and  combinational  circuits,  we  have  shown  that  M 
can  be  tested  with  t w 100  test  patterns.  Furthermore,  a k-bit  processor 
array  constructed  from  k copies  of  M can  also  be  tested  with  t tests, 
independent  of  k,  and  the  array  tests  can  be  easily  derived  from  those  of 
the  individual  cell. 
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5.  INTERACTIONS 


Meetings  with  Air  Force  Personnel 

J.  P.  Hayes  met  with  Dr.  Joseph  Bram,  AFOSR  Directorate  of  Mathematical 

and  Information  Sciences,  in  Los  Angeles,  on  January  30,  1978.  Current 

progress  and  future  plans  for  the  project  being  reported  here  were  reviewed. 

J.  P.  Hayes  met  with  Mr.  Armand  Vito  of  RADC  (ISCA)  in  Marina  Del  Rey, 

California  on  April  6,  1978  to  discuss  research  topics  of  mutual  interest. 

J.  P.  Hayes  visited  RADC,  Rome,  'New  York,  May  12-13,  1978.  He  met 
with  Mr.  Murray  Kesselman  (ISCA)  who  provided  him  with  a detailed  overview 
of  Air  Force  research  interests  in  the  areas  of  computer  architecture  and 
fault-tolerant  computing.  He  also  met  with  Lt.  Michael  Troutman  (ISCA) 
and  discussed  the  Air  Force  sponsored  Total  System  Design  (TSD)  and 
Multi -Microprocessor  System  (MMS)  projects.  Dr.  Hayes  had  an  opportunity 
to  see  some  of  RADC's  research  facilities,  including  its  QM-1  and  STARAN  computers. 

Attendance  at  FTCS-8 

J.  P.  Hayes  and  R.  Yanney  attended  the  1978  International  Symposium 
on  Fault-Tolerant  Computing  (FTCS-8)  in  Toulouse,  France,  June  21-23,  1978. 

This  is  the  major  annual  conference  on  research  in  fault  tolerance. 

Approximately  350  researchers  from  25  countries  attended  FTCS-8.  The  paper 
"Fault  recovery  in  multiprocessor  networks"  (see  Apprendix)  was  presented 


at  this  conference. 


6.  SUMMARY  AND  FUTURE  PLANS 


We  have  developed  a new  model  for  measuring  the  recovery  time  of  a 
fault- tolerant  system  based  on  the  facility  graph  concept.  Necessary  and 
sufficient  conditions  for  an  arbitrary  system  to  be  k-step  recoverable  were 
obtained.  A survey  of  communication  networks  for  multi-microprocessors 
was  carried  out.  The  diagnosability  of  the  n-cube  interconnection  network 
was  characterized.  An  analysis  of  che  fault  tolerance  properties  of  connecting 
networks  was  initiated  using  the  concept  of  dynamic  full  access.  A 
design  theory  for  MSI/LSI  systems  based  on  a formal  definition  of  recursive 
expansibility  was  developed.  It  was  shown  that  this  approach  car.  be  used 
for  test  pattern  generation  for  a variety  of  complex  systems  including 
bit-sliced  microprocessors. 

In  the  area  of  reconfiguration  and  recovery  we  propose  to  investigate 
strategies  for  achieving  fault  tolerance  in  distributed  systems  when  the 
individual  processors  have  limited  information  about  the  system  as  a whole. 

We  also  intend  to  study  graceful  degradation  in  such  systems.  We  propose 
to  continue  our  analysis  of  communication  networks  for  multi-microprocessors, 
wi  a the  aim  of  completely  cnaracterizir.g  their  fault  tolerance  properties. 

We  plan  to  extend  our  analysis  of  bit-sliced  microprocessors  to  include 
all  the  features  of  real  systems.  We  further  aim  to  extend  it  to  other 
bit-sliced  comjjonents  such  as  microprogram  sequencers  and  RAM's  so  that 
ultimately  we  ar.  automatically  generate  a near-optimal  test  set  for  complete 
microcomputers  that  iso  bit-slicing  technology.  Finally,  we  hope  to  use  our 
knowledge  of  the  test  requirements  of  bit-sliced  microcomputers  to  analyze 


r.on-bit-sliced  systems. 
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ABSTRACT 

A method  for  characterizing  dynamic  reconfig- 
uration and  recovery  in  fault-tolerant  networks  of 
processors  is  proposed.  A network  is  represented 
by  a graph  C,  whose  nodes  correspond  to  process- 
ors and  whose  edges  correspond  to  communication 
links.  Each  node  or  edge  has  three  mijor  states: 
active,  inactive  (spare)  and  failed.  Cr  tolerates  a 
fault  F by  activating  spare  nodes  and  edges  to  re- 
configure around  the  failed  components  so  that  an 
active  subnetwork  Isomorphic  to  a basic  system  G6 
is  maintained.  Gr  is  called  k-fault-to'erant  (k-FT) 
t-step  recoverable  (t-SP)  if  it  can  recover  from  k or 
fewer  node  failures  by  changing  the  states  ofatmost 
t fault-free  nodes,  e.  g..  by  activating  t spare  nodes. 
Thus  t Is  a measure  of  system  recovery  time.  A 
t-FT  system  is  called  optimally  t-SR  if  it  contains  t 
spare  nodes  and  the  minimum  number  of  edges  that 
permit  t-step  recovery  from  all  tolerated  faults. 
Necessary  and  sufficient  conditions  for  Gr  to  be  op- 
timally t-SR  with  respect  to  an  arbitrary  network  G„ 
are  obtained.  Techniques  for  achieving  t-step  re- 
covery where  t>k  are  discussed,  with  particular 
reference  to  networks  with  restricted  node  fanout, 
a constraint  imposed  by  most  microprocessors.  A 
graph  transformation  technique  based  on  line  graphs 
is  described  that  simplifies  the  calculation  of  k and 
t. 

I.  fNTRO  DUCT  ION 

Most  previous  research  In  fault-tolerant  com- 
puter design  has  been  concerned  either  with  system 
reliability  or  fault  diagnosis.  Other  important  as- 
pects of  system  behavior,  notably  recovery,  have 
received  little  attention,  even  though  they  play  a 
central  role  In  fault  tolerance.  In  'his  paper  a 
graph  theoretical  model  for  fault  recovery  in  com- 
plex systems  Is  presented.  The  model  is  particu- 
larly applicable  to  large  multiprocessors.  Systems 
containing  thousands  of  microprocessors  have  been 
proposed  recently  and  are  likely  to  proliferate  In  the 
future  [I,  2],  ft  can  be  expected  that  many  of  these 
m iltlmicroprocessor  systems  will  have  fajlt  toler- 
ance as  a major  design  goal. 

A system  Is  modeled  here  by  a graph  whose 
nodes  represent  hardware  components,  e.  g.  , pro- 
cessors or  computers,  and  whose  edges  represent 
communication  links,  e.  g.  , switching  networks  or 
buses.  Similar  models  have  been  used  previously 
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In  the  analysis  of  computer  network  reliability  [ 3 J , 
self-diagnosability  ( 4 J,  and  fault  tolerance  [5].  These 
are  all  primarily  structural  rather  than  behavioral 
models,  since  the  graphs  used  represent  the  physical 
or  logical  interconnection  structure  of  the  system  un- 
der consideration.  As  such  they  are  to  be  contrasted 
with  models  such  as  Petri  nets  or  s^ate  graphs  that 
are  primarily  behavioral  [b]. 

II.  RECOVERY  MODEL 

Following  [5],  a computer  system  Is  described 
by  a (facility)  graph  whose  nodes  represent  (micro-) 
processors  and  whose  edges  represent  communica- 
tion paths.  All  nodes  are  assumed  to  be  of  the  same 
type  and  to  have  the  same  processing  abilities.  Edges 
are  assumed  to  be  undirected.  A fault  is  represented 
by  the  removal  of  nodes  and  edges  from  the  graph. 

Definition  1:  A basic  graph  C4  is  a graph  that  repre- 
sents the  minimum  system  configuration  needed  to 
perform  a certain  set  of  tasks.  Thus  a basic  system 
cannot  tolerate  any  faults. 

Definition  2:  A redundant  graph  Gf  with  respect  to  a 
basic  graph  G,  is  one  that  contains  Gt  as  a proper 
subgraph.  In  other  words,  a proper  subgraph  G^  of 
G,  is  isomorphic  to  Gt,  denoted  GJ  « G,.  Gr  is 
viewed  as  a fault-tolerant  realization  of  Gt. 

At  any  time,  some  subgraph  G{  ■ Gb  of  G,  repre- 
sents an  active  system  engaged  in  data  processing. 
The  remaining  part  of  G, , denoted  Gr- GJ , repre- 
sents either  unused  (spare)  or  unusable  (faulty)  com- 
ponents. Thus  every  node  x of  Gr  can  be  viewed  as 
having  three  possible  states: 

(1)  active,  that  is  xf  C, 

(2)  spare 

(3)  faulty  . 

Definition  3 [5]:  Gr  Is  k-fault  tolersnt  (k-FT)  with 
respect  to  G,  if  the  removal  of  any  k nodes  (and  the 
edges  connected  to  those  nodes)  from  Gr  results  in  a 
graph  that  contains  Gk. 

It  is  assumed  that  the  systems  of  interest  con- 
tain a mechanism  for  continuous  self-diagnosis.  For 
example,  each  node  may  be  regularly  tested  by  one 
or  more  of  its  neighboring  nodes.  The  precise  man- 
ner in  which  diagnosis  Is  achieved  is  not  of  d.rect  in- 
terest here.  Once  a faulty  active  node  is  detected,  a 
process  of  recovery  is  initiated  which  Involves  re- 
plaa  ng  the  active  subsystem  G{  by  another  subsystem 
GJ  ■ G,  which  contains  no  faulty  nodes.  This  means 
that  If  GJ  contains  k faulty  nodes,  at  least  k previous- 
ly spare  nodes  must  be  changed  to  the  active  state 
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and  must  be  included  in  Gf.  The  manner  in  which 
the  new  active  subsystem  G*  is  determined  consti- 
tutes the  recovery  strategy.  In  this  paper  aspects 
of  recovery  are  considered  that  are  largely  indepen- 
dent of  the  particular  recovery  strategy  employed. 
Note  that  recovery  is  being  viewed  primarily  as  a 
process  of  reconfiguration  around  the  faulty  nodes. 
The  possible  changes  of  state  that  a node  can  exper- 
ience during  system  operation  are  illustrated  in 
Fig.  1. 


Fig.  1.  State  diagram  for  a system  node. 

The  recovery  process  often  involves  a consid- 
erable amount  of  information  transfer  among  the 
system  nodes.  For  example,  a spare  node  s that  is 
being  activated  to  replace  a defective  node  x must 
be  provided  with  all  information  defining  the  func- 
tions of  x,  as  well  as  the  status  of  x at  the  last 
known  (error-free)  check-point.  This  information 
is  transferred  to  s from  x or  from  some  other  pro- 
cessor that  stores  the  status  of  x,  e.g.,  a system 
supervisor.  The  number  of  fault-free  nodes  whose 
state  or  identity  is  changed  when  forming  Gf  from 
Gf  is  taken  as  a measure  of  system  recovery  time, 
and  leads  to  the  following  definition. 

Definition  4:  Gr  is  t-  step  recoverable  (t-SR)  with 
respect  to  Gb  if  Gp  is  t-r  T with  respect  to  G,  and  Gr 
can  recover  from  any  fault  affecting  k 5 t nodes  by 
changing  the  state  or  identity  of  at  most  t fault-free 
nodes. 

In  many  cases  recovery  can  be  accomplished  by  re- 
placing the  k faulty  nodes  of  Gf  by  k spare  nodes. 
Spare  nodes  are  assumed  to  be  fault-free  when  they 
are  first  activated;  they  may  subsequently  become 
faulty  and  require  replacement.  It  may  also  be  ne- 
cessary to  replace  active  nodes  as  well,  either  by 
changing  active  nodes  to  spares,  or  requiring  an 
active  node  to  assume  the  identity  of  another  active 
node.  The  parameter  t defined  above  is  independent 
of  the  recovery  strategy  R used  and  the  choice  of 
the  initial  active  configuration  Gf.  It  states  that 
some  R and  Gf  exist  making  t-step  recovery  possi- 
ble for  all  sequences  of  up  to  t faults. 

Example  1:  Consider  the  graphs  shown  in  Fig.  2. 

Hr  is  clearly. 1- FT  with  respect  to  Hb  since  if  Gf 
comprises  nodes  B and  C,  the  system  can  recover 
in  one  step  by  replacing  the  faulty  node  B (C)  by  the 
spare  node  D 'A).  Note  that  if  the  subgraph  consist- 
ing of  A and  B is  chosen  as  Gf,  recovery  requires 
two  steps  in  the  event  of  the  failure  of  node  B.  In 
this  case,  the  active  node  A must  also  be  replaced 
by  one  of  the  spare  nodes  C or  D.  q 


A B C D 
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H,  Hb 

Fig.  2.  Example  of  a system  Hr  that  is  1 'ep 
recoverable  with  respect  to  H. 

The  calculation  of  the  fault  tolera  i re- 

covery measures  k and  t for  arbitrary  phs  Gr 
and  Gb  is  very  difficult.  In  order  to  fino  out  If  Gr 


tolerates  a given  fault  F,  it  is  necessary  to  determine 
if  the  graph  Gf  representing  the  faulty  system  contains 
a subgraph  isomorphic  to  Gb.  This  is  the  well-known 
subgraph  isomorphism  problem.  It  may  be  necessary 
to  examine  all  subgraphs  of  Gr  that  are  isomorphic  to 
Gb  in  order  to  determine  if  Gr  is  l-SR  with  respect  to 
Gb.  While  the  general  subgraph  isomorphism  problem 
is  computationally  very  complex,  efficient  (polynom- 
ial time)  algorithms  are  known  for  many  special  class- 
es of  graphs,  while  efficient  heuristic  procedures  are 
known  for  the  general  case  [7). 

III.  OPTIMAL  t-STLP  RECOVERY 

It  is  clearly  desirable  that  Gf  and  Gf  should  share 
as  many  unaltered  nodes  as  possible  in  order  to  mini- 
mize the  recovery  time.  The  fastest  recovery  will  be 
achieved  when  none  of  the  fault-free  active  nodes  of 
Gf  are  affected  in  forming  Gf,  i.e.,  exactly  t spare 
nodes  are  used  to  replace  the  t faulty  nodes. 

Definition  5:  Gf  is  optimally  t-SR  with  respect  to  the 
n-node  system  Gb  if 

(1)  Gr  is  t-step  recoverable  with  respect  to  Gb; 

(2)  G.  contains  the  minimum  possible  number  of 
nodes,  namely,  n + t; 

(3)  Gr  contains  the  fewest  edges  among  all  redundant 

systems  satisfying  (1)  and  (2). 

V/ e now  show  that  every  nontrivial  connected  basic 
system  Gb  has  a unique  and  easily-characterized  op- 
timal t-SR  realization  G?’". 

Theorem  1:  Let  GJ’r  be  formed  from  G.  as  follows. 

Introduce  t spate  nodes  s,,  sa st  and  introduce 

edges  connecting  each  st  to  every  node  in  Gb  and  the 
t-  l nodes  Sj  where  i i j.  Gr  is  optimally  t-SR  with  re- 
spect to  Gb  if  and  only  if  Gr  a Gf,T. 

Proof:  First  we  show  that  Gf'1  is  t-SR  if  Gb  is  the 
original  active  subsystem.  Let  x be  any  faulty  node, 
x can  be  replaced  in  one  step  by  any  spare  node  s., 
since  s,  is  adjacent  to  all  the  nodes  that  are  adjacent 
to  x.  Any  sequence  of  t node  failures  can  be  toler- 
ated similarly,  since  every  node  in  Gb,"I  including 
the  t original  spares,  can  be  replaced  by  a spare  in 
one  step.  Thus  the  t spares  allow  t faulty  nodes  in  Gb 
to  be  replaced  in  t steps,  implying  that  G-'T  is  t-SR 
with  respect  to  Gb. 

Let  Gf  be  any  optimal  t-SR  system.  V/e  now 
show  that  Gf  contains  a subgraph  isomorphic  to  G;*', 
hence  Gf  s G®,r.  Let  Gb'  be  the  initial  active  subsys- 
tem of  Gf,  so  that  Gf  ■ Gb.  Let  the  t nodes  of  Gf-Gb 
be  designated  sf,  sf,  ....  sf.  It  remains  to  show  that 
each  s*  is  adjacent  to  every  node  of  Gf.  Suppose  by 
way  of  contradiction  that  sj*1  is  not  adjacent  to  y.. 

There  are  two  possible  cases: 

Case  I:  yf  £ Gf.  (Since  Gb  is  nontrivial,  Gf  contains 
at  least  two  nodes. ) Let  y*  € Gf  and  let  yf  and  yf  be 
adjacent.  Suppose  that  a sequence  of  t nodes  failures 
occurs  affecting  yf  and  each  of  the  spare  nodes  acti- 
vated to  replace  yf.  At  some  point  sf  must  be  used 
to  replace  yf  since  G,  is  t-SR  and  only  t spare  nodes 
are  available,  Including  sf.  However  sf  is  not  adja- 
cent to  yf  and  yf yf  is  an  edge  of  Gf,  hence  s.  cannot 
replace  yf.  Consequently  G,  is  not  t-SR,  a contra- 
diction. Thus  sL  must  be  adjacent  to  every  node  of 
Gb . 

Case  2:  yf  € Gf-Gf,  i.e.,  yf  = sf.  Again  consider  a 
sequence  of  t node  failures.  After  fewer  than  t-1 
failures  either  sf  or  sf  must  be  activated,  say  sf. 
sf  has  at  least  one  neighbor  zf  which  is  part  of  the 
currently  active  system.  Suppose  that  all  subsequent 
faults  involve  zf  and  its  replacements.  Eventually 
sj  will  be  the  only  nonfaulty  spare  node  available  to 
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Fig.  2 is  K3.  and  K(K3)  =KV 


replace  z*.  Since  s*  is  not  adjacent  to  s*  (which  is 
now  part  of  the  active  subsystem),  sf  cannot  take  the 
rote  of  z.*,  hence  G*  cannot  tolerate  the  t-fault  in 
question,  a contradiction.  Thus  st:  is  adjacent  to 
every  node  Sj  t s*. 

We  have  shown  therefore  that  the  spare  nodes 
of  G*  are  connected  to  every  node  of  G,5  so  Gt  and 
G°'r  are  isomorphic.  Hence  every  optimal  t-SR 
system  is  isomorphic  to  G°'\  O 

Example  2:  Fig.  3a  shows  a basic  graph  Ib>  and 
Fig.  3b  shows  the  corresponding  optimal  2-SR  sys- 
tem I^'T  obtained  by  the  procedure  described  in 
Theorem  l.  □ 


Fig.  3.  (a)  A basic  graph  It.  (b)  The  correspond- 
ing optimal  2-SR  graph  I|'T 

Optimal  t-SR  systems  can  also  be  character- 
ized in  terms  of  their  clique  graphs.  Let  K,  denote 
a complete  graph  of  n nodes,  i.e.,  an  n-node  g raph 
containing  ail  possible  edges. 

Definition  6 [8]t  A clioue  of  a graph  G is  a maximal 
complete  subgraph  of  G.  The  clique  graph  K(G)  of 
G is  the  intersection  graph  formed  by  the  cliques  of 
G,  i.e.,  there  is  a one-to-one  correspondence  be- 
tween the  cliques  of  G and  the  nodes  of  K(G),  and 
two  nodes  In  K(G)  are  adjacent  if  and  only  if  the  in- 
tersection of  the  corresponding  cliques  in  G is  non- 
empty. 

Theorem  2:  If  Gr  is  an  optimal  t-SR  realization  of 
some  <j,,  then  K(G()  is  complete. 

Proof:  Suppose  K(Gr)  is  not  complete.  Then  Gr 
has  two  cliques  Qx  and  C,  which  have  no  node  In 
common.  There  exists  a spare  node  in  the  initial 
configuration  of  Gr  which  is  not  adjacent  to  any 
nodes  In  C,  or  C,.  Hence  Gf  cannot  be  isomorphic 
to  G^'r  and  so,  by  Th.  1,  it  Is  not  optimally  t-SR,  a 
contradiction.  Hence  K(Gf)  must  be  complete.  3 

Fig.  4 shows  the  clique  graphs  for  Hr  and  P*' 
from  Figs.  2 and  3,  respectively.  Hf  has  three 
cliques  Isomorphic  to  K#.  Since  two  of  these 
cliques  are  disjoint,  K(Hr)  is  not  complete. 
has  four  cliques  isomorphic  to  Kv  hence  K(KJ*')  e 
K4.  Note  that  the  optimal  1-SR  graph  for  H,  in 


Fig.  4.  Clique  graphs  (a)  for  Hr  of  Fig.  2. 

(b)  for  Cf1  of  Fig.  3. 

IV.  GENERALIZED  t-STEP  RECOVERY 

The  optimal  t-SR  design  considered  in  the  pre- 
ceding section  have  the  disadvantage  that  the  maxi- 
mum node  degree  in  GJ>T  can  be  very  iarge.  If  Gb 
contains  n nodes  then  the  spare  nodes  s,  in  Gj*T  have 
degree  n+t-1,  which  is  the  maximum  possible  degree 
in  an  (n+t)-node  graph.  Node  degree  corresponds  to 
the  number  of  input/output  ports  of  a processor,  or 
its  fanout,  and  this  is  usually  limited  by  physical 
considerations.  In  the  case  of  microprocessors,  the 
number  of  parallel  data  paths  that  can  be  connected 
to  the  microprocessor  is  severely  restricted  by  in- 
tegrated circuit  pin  limitations.  Thus  it  is  of  inter- 
est to  consider  nonoptimal  redundant  systems  in 
which  node  degree  is  limited. 

In  the  definition  of  t-SR  given  earlier  it  was  as- 
sumed that  the  system  was  required  to  tolerate  up  to 
t faults.  We  now  give  a more  general  definition  in 
which  the  number  of  faults  tolerated  and  the  number 
of  recovery  steps  are  distinguished. 

Definition  7:  Gf  is  k-fault  tolerant  t-step  recoverable 
(k-FT/t-SR)  with  respect  to  G^  if  Gf  can  recover 
from  up  to  k faults  in  Gr  in  at  most  t recovery  steps, 
that  is,  by  changing  node  states  or  identities  at  most 
t times. 

In  general,  k s t.  When  k = t the  system  will  also 
be  called  simply  t-SR  conforming  with  the  earlier 
definition. 

Example  3:  Fig.  5 shows  three  different  1-FT  reali- 
zations of  the  basic  graph  C,a,  which  is  the  cycle 
with  12  nodes.  Fig.  5a  shows  the  optimal  1-FT/l-SR 
graph  as  defined  by  Th.  1.  Note  that  the  centrat 
"spare"  node  has  degree  12.  Fig.  5b  shows  another 
1-FT/l-SR  version  of  C,a  which  contains  two  spare 
nodes  and  so  is  nonoptimai;  however,  its  maximum 
node  degree  is  only  6.  The  graph  in  Fig,  5c  is  the 
1-FT  realization  of  Clt  which,  as  proven  in  [ 5 J,  con- 
tains the  minimum  number  of  edges.  It  also  has  the 
smallest  possible  node  degrees,  however,  it  is  8-SR. 
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Than  there  are  fundamental  tradeoffs  involving  the 
number  of  spares,  the  maximum  node  degree,  and 
the  maximum  number  of  recovery  steps  t. 


(b) 


Fig.  5.  Three  l-FT/t-5R  realizations  of  Cla 

As  noted  in  $2,  the  computation  of  k and  t for 
arbitrary  k-FT/t-SR  systems  is  very  difficult. 
There  are  two  possible  ways  in  which  this  computa- 
tional complexity  prohlem  can  be  avoided. 

(1)  We  can  restrict  our  attention  to  graphs  with  pro- 
perties such  as  structural  regularity  which 
simplify  fault  analysis. 

(2)  We  can  attempt  to  transform  the  given  graphs 
into  graphs  that  are  easy  to  analyze,  and  are 
such  that  the  fault  tolerance  properties  of  the 
original  graphs  can  be  obtained  from  the  trana- 
fo  rmed  g raph  s. 


In  the  remainder  of  this  paper,  we  examine  a 
special  class  of  graphs  called  line  graphs  for  which 
fault  analysis  is  relatively  easy.  Jvioreover,  an  arbi- 
trary graph  can  readily  be  converted  in'o  a line  grapn 
by  the  addition  of  nodes  and  edges  [8].  First  we  de- 
fine and  characterize  line  graphs. 

Definition  8 [8j:  The  line  graph  of  a graph  G,  denoted 
L(G),  is  a graph  whose  nodes  are  in  one-to-one  cor- 
respondence with  the  edges  of  G.  Two  nodes  in  L(G) 
are  adjacent  if  and  only  if  the  corresponding  edges  of 
G are  adjacent.  If  H is  a line  graph,  then  there  exists 
a graph  G such  that  L(G)  is  isomorphic  to  H.  G is 
called  the  root  graph  of  H and  will  be  der  ed  by  Lr'(H). 

It  is  obvious  that  every  graph  has  a line  graph, 
however  it  is  not  necessary  for  every  graph  to  be  a 
line  graph  of  another  graph.  Very  efficient  algorithms 
are  known  for  determining  if  G is  a line  graph  and,  if 
it  is,  for  generating  its  root  graph  f 9 ] . Line  graphs 
have  been  studied  extensively;  the  following  theorem 
summarizes  their  major  characteristics.  Let  a de- 
note the  star  graph  [8]  which  contains  n + i nodes,  and 
n edges,  with  n of  the  nodes  joined  to  the  remaining 
node. 

Theorem  3 [ 3 ] : Properties  of  line  graphs. 

(a)  If  G,  and  Ga  are  any  two  nontrivial  connected 
graphs  except  K3  and  K;jJ,  then  L(Gj)  is  isomorphic  to 
LfGj)  if  and  only  if  Gt  is  isomorphic  to  Ga. 

(b)  G is  isomorphic  to  L(G)  if  and  only  if  G is  a 

cycle. 

(c)  If  G is  a line  graph  then  the  edges  of  G can 
be  partitioned  into  complete  sibgraphs  [C,}  in  such  a 
way  that  no  node  lies  in  more  *han  two  of  the  subgraphs, 
and  there  is  a one-to-one  correspondence  between  fC,} 
and  the  nodes  of  L":(G). 

(d)  Line  graphs  of  regular  graphs  with  degree  d 
are  regular  with  degree  2(d-l). 

Example  4:  Fig.  6 illustrates  Th.  3c.  The  complete 
subgraphs  {C,}  in  the  line  graph  I.(J)  correspond  to  fhe 
nodes  [xt}  in  its  root  graph  J.  C 

Def.  8 implies  that  we  can  define  a function  L 
that  transforms  a graph  into  Its  line  graph,  and  a func- 
tion L"1  that  transforms  a line  graph  into  its  root 
graph.  The  following  notation  is  also  useful 

Llfl(G)  = L(L*(G)) 


L'(l  + 1)(G)  = L“l(L‘l(G» 


where  i*  1.  Menon  [10]  has  shown  that  I.-'(G)  has 
fewer  nodes  than  G if  G is  not  a cycle  or  a path,  hence 
L" *( G)  is  usually  simpler  than  G,  We  will  now  show 
that  If  a redundant  system  Gr  is  a line  graph,  many  of 
its  properties  pertaining  to  fault  tolerance  can  be  de- 
termined with  less  computation  from  L*'(Gf). 

Theorem  4:  If  Gf  is  k-FT  with  respect  to  Gs,  then 
L(Cr)  is  k-FT  with  respect  to  LfG,). 

Proof:  Th.  3c  implies  that  a one-to-one  correspon- 
dence exists  between  the  nodes  {x.l  of  Gf  and  a subset 
{C,l  of  the  complete  subgraphs  of  L(Gr)  where  the 
C C, } include  all  nodes  of  Gr.  Suppose  a k-fault  in 
L(Gf)  effectively  eliminates  a set  S of  k nodes  to  form 

a new  graph  H.  Let  Ct,  C, C,  be  any  set  of  j S k 

members  of  fC,}  that  contain  S,  and  let  H be  the  re- 
sult of  removing  C,,  C, C.  from  L(Gt).  There  are 

j nodes  xJ#  xa, . . . , Xj  in  G,  such  that  x.  corresponds  to 
C.inL(GF|forisl,2,.,,,j,  If  G*  is  the  result  of  re- 
moving these  j nodes  from  Gr,  then 


Fig.b.  (a)  A graph  J.  (b)  Its  line  graph  MJ)  show- 
ing the  complete  subgraphs  f C, } o f I-'J) 
that  correspond  to  the  nodes  fx,}  of  J. 


Since  Gf  is  k-  FT  with  respect  to  G,  and  j 3 k,  we 
conclude  tha  t Gt  S Gp  . Hence 

L(G.  ) C L(C*)  c H 
t>  r 

implying  that  H,  which  represents  MGr)  with  a k- 
fault  present,  contains  a subgraph  Isomorphic  to 
MGt).  It  follows  that  L(Cf)  la  k-FT  with  respect  to 
L(G4).  O 

Note  that  the  converse  of  Th.  4 is  false. 

Theorem  5;  If  G.  is  k-SP  with  respect  to  Glt  then 
MGr)  is  k- FT/(2kd-k) -SR  with  respect  to  MGJ 
where  d Is  the  largest  degree  of  any  node  In  G,. 

Proof:  As  in  the  proof  of  Th.  4.  every  set  of  k nodes 
in  MG,)  is  contained  in  j * k complete  subgraphs  C * 

f C,,  C3 C. } which  correspond  to  nodes  X * 

^Xj,  xa x j In  G,.  Since  G,  is  k-SR,  G,  can  re. 

cover  from  the  removal  of  X in  at  most  k steps,  i.  e., 
by  changing  the  state  or  Identity  of  at  most  k fault* 
free  nodes.  Every  clique  in  MG,)  contains  at  mos* 
d nodes.  Hence  L(Gr)  can  recover  from  a k*fault  in 
C by  deactivating  at  most  kd-k  fault*free  nodes  In  C. 
i.  e.  , by  removing  C,  and  by  changing  the  states  or 
identities  ot  an  additional  kd  nodes  to  replace  C. 
Hence  MG,)  can  recover  from  a k-fault  in  at  most 
2kd*k  steps.  3 

Example  4;  Figs.  7a  and  7b  show  *wo  line  graphs 
Pf  and  P,.  Consider  the  problem  of  determining 
alies  of  k and  t such  that  Pf  is  k-FT/t-SR  with  re- 
spect to  P . The  problem  is  greatly  simplified  if 
we  replace  Pf  and  P,  by  their  root  graphs  L_l(Pf)and 
l*  (P,)  which  appear  in  Figs.  7c  and  7d,  respective- 
ly. By  Th.  (,  tr‘(Pr>  is  optimally  I -SR  with  respect 
to  LT:<  P,)»  The  max. mum  node  degree  of  I"  ’<  P.)  is 
three,  hence  by  Th.  5,  Pf  is  1-FT/5-3R  with  respect 


Fig.  7.  (a)  The  redundant  graph  P,.  (b)  The  basic 
graph  P,.  (c)  The  roo*  graph  I*"'(P,).  (d) 

7 he  root  graph  LT‘*(P,). 

Theorems  4 and  5 can  also  be  used  to  construct 
k-FT/t-SR  systems  with  nodes  of  lower  degree  than 
the  corresponding  optimal  t-SR  systems,  the  case  of 
regular  basic  graphs.  (A  graph  is  regula r if  all  its 
nodes  have  the  same  degree  d.  ) The  reduction  in  the 
node  degree  of  G becomes  more  apparent  as  d in- 
creases. The  following  example  illustrates  this. 

Example  1 1 Suppose  a 1-FT  realisation  of  a certain 
regular  graph  Q,  is  required  where  Q,  has  20  nodes  of 
degree  8.  Fig.  8a  shows  l"‘(Q,).  f'.Ve  omit  the  dia- 
gram for  Q,  because  of  its  complexity.)  Using  Th.  1 
the  graph  l."‘*(Qt)  shown  in  Fig.  6b  can  be  constructed. 
l.-KQ,)  is  an  (optimal)  l -SR  realization  of  L" HQ,). 

Now  construct  the  line  graph  Q.  of  L_S(Q,),  which  by 
Th.  5,  Isa  l -FT/O-SR  realization  of  the  original  sys- 
tem Qj.  Qt  has  23  nodes,  and  12  is  its  maxlrr.  urn 
node  degree.  While  Q,  has  far  more  spare  nodes'han 
the  optimal  i-SP  realization  of  Q,,  the  latter  contains 
nodes  with  degree  20.  Q 
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Fig.  8.  {«)  The  graph  L“t(Qj.  (b)  An  optimal 
I-SR  realization  of  L“KQr). 
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